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Abstract. Identifying the social actor has become one of tasks in Artifi- 
cial Intelligence, whereby extracting keyword from Web snippets depend 
on the use of web is steadily gaining ground in this research. We develop 
therefore an approach based on overlap principle for utilizing a collec- 
tion of features in web snippets, where use of keyword will eliminate the 
un-relevant web pages. 
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1 Introduction 

One of tasks in extracting information about social actor such as in social net- 
work extraction is identification of appropriate actor pQ. Today, together with 
information explosion on Internet, it is difficult to associate the web pages to 
the intended social actor correctly and precisely, mainly by the presence of se- 
mantic relation as synonymy and polysemy. For example, in DBLP the author 
name "Tengku Mohd Tengku Sembok" is sometimes written as "Tengku Mohd 
Tengku Sembuk", or "Tengku M. T. Sembok", in another case, an actor named 
of " Shahrul Azman Mohd Noah" has a name label as " Shahrul Azman Noah" , 
are two of many cases about social actor which may have multiple name vari- 
ations/abbreviations in citation across publication. In other case, names like 
"Michael D. Williams" and "Mark D. Williams" have used another label name 
as "MD Williams" in some citations. It is a case of different social actors may 
share the same name label in multiple citations. 

Most of works have addressed two above problems as a name disambiguation 
or a co-reference among them are the preparing of person-specific information 
[2] , the finding the association of person [3] , distinguishing the different persons 
with keyword/key-phrase [3], domain of research paper citation [S], etc. However, 
little works attempt to extract flexible features of person as keyword for extract- 
ing social networks from Web, or to deal with the information explosion on the 
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Internet that is increasing and expanding the gap of relationship between the ac- 
tors and words continuously. This paper aims to address keyword for identifying 
social actors with exploring Web snippets as source of flexible feature. 

2 Problem Definition 

In semantic the disambiguation is the process of identifying related to essence of 
word, the sense inherited to objects or social actor names, and also the meaning 
embedded to it by how societies use it. The meaning found in dictionaries, the 
dictionaries were written based on events within social, and they were distributed 
in web pages, while sometimes an event exists in/on the another events. There- 
fore, the disambiguation issues are subject to kind of overlap principles, i.e. a 
paradigm to understand some event in the world by the intersection parts, which 
distinguishes one actor as special case of WSD (word sense disambiguation) [B] , 
especially for regarding a person [7]. There are motivations of disambiguation 
problem, i.e. 

1. Meronymy |8l9j : x is part y or "is-a", part to whole relation - the semantic 
relation that holds between a part and the whole. In the other word, the page 
for x belongs to the categories the y. For example, the page for the Barack 
Obama in Wikipedia belongs to the categories (a) President of United States, 
(b) United States Senate, (c) Illinois Senate, (d) Black people, etc. In other 
case, some social actors are associated with one or more categories. For 
example, Noam Chomsky is a linguist and Noam Chomsky is also a critic of 
American foreign policy. 

2. Holonomy: x has y as a part of itself or "has- a", whole to part relation - the 
semantic relation that holds between a whole and its parts. For example, in 
DBLP, the author name "Shahrul Azman Mohd Noah" has a name label as 
" Shahrul Azman Noah" . 

3. Hyponymy [1QJ : x is subordinate of y or " has-property" , subordination - 
the semantic relation of being subordinate or belongs to a lower rank or 
class. In other word, the page for x has subcategories the y. for example, 
the homepage of Tengku Mohd Tengku Sembok has categories pages: Home, 
Biography, Curriculum Vitae, Gallery, Others, Contact, Links, etc. Some 
pages also contain name label "Tengku Mohd Tengku Sembok" 

4. Synonymy [11112113] : x denotes the same as y, the semantic relation that 
holds between two words or can (in the context) express the same meaning. 
This means that names of actor may have multiple label variations/abbreviations 
in citations across publications. For example, in some web pages, the author 
name " Tengku Mohd Tengku Sembuk" is sometimes written as " T Mohd T 
Sembok", "Tengku M T Sembok". 

5. Polysemy 12 13 : Lexical ambiguity: individual word or phrase or label that 
can be used to express two or more different meanings. This means that 
different actors may share the same name label in multiple citations. For 
example, both "Guangyu Chen" and "Guilin Chen" are used as the co- 
reference of "G. Chen" in their citations. 
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Let A is a set of actors, i.e. {a,i\i = 1, . . . , n}. We can classify some sets of 
actors based on the labels of actors. 

1. Ad = {dj\j = l,...,m} is a set of ambiguous names which need to be 
disambiguated, e.g., {"Abdullah Mohd Zin", "Shahrul Azman Mohd Noah", 
. . .}. Thus, A be a table of reference names containing actors which the 
names in Ad represent: A = {"Abdullah Mohd Zin (academic)", "Abdullah 
Mohd Zin (policitian)" ,. . .}. 

2. A t = {ai\l = 1, k} is a set of composition of names token (first /middle/last 
name or abbreviation), and due to an actor has multiple name variations: 
"Shahrul Azman Mohd Noah (Professor)" may appear in multiple web pages 
under different names such as " Noah, Shahrul Azman Mohd" , " Shahrul Az- 
man Noah" , " Shahrul A. Noah" , " Noah, Shahrul A." , " Shahrul Azman MN" , 
and Shahrul Azman bin Mohd Noah", where "bin" is special token meaning 
"son of. 

3. A x = {a p \p = 1, . . . , o} is a set of actors which can be observed in web pages 
(documents). The names need patterned and disambiguated. The actor's 
names can be rendered differently in online information sources. They are 
not named with a single pattern of tokens only, they are not also labeled 
with unique identifiers, and so they are uncertain persons. 

Therefore, name disambiguation is an important problem in extracting infor- 
mation of social actors whereby the social actors can be expressed by using dif- 
ferent aliases due to multiple reasons as motives: use of abbreviations, different 
naming conventions, misspelling, pseudonyms in publication or bibliographies 
(citations), or naming variations over time. We conclude that there are two fun- 
damental reasons of name disambiguation in semantic for identifying actors, as 
follows: 

1. There are a relation (f> to assign fi as space of events containing actors to A: 

a relation <f>± : fi A x and a relation 2 : A x ^li^ 2 A, M = N\ + N 2 , 

where tfii and 4>2 are relations <pt : A t — 4* A and <pd '■ Ad — A. A t , Ad C A x 
such that <f> = 4>\4>2 ■ O — > A. 

2. There are a relation ip to assign A containing a to !?, i.e. there are a relation 

(ft : A - — I A t and a relation ipd ■ A — 4 Ad- A tl Ad C A x for getting a 
relation ipi = <pt<fid ■ A — > A x where there is a relation ip2 : A x — > 
L, L C Q such that ip = tpi<p2 ■ A — > L. 



3 The Proposed Approach 

The start of this approach is some concepts as follows. 

1. A word w is the basic unit of discrete data, defined to be an item from a 
vocabulary indexed by {1, . . . , K}, where wu = 1 if k € K and uik = 
otherwise. 
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2. A term t& consist of at least one word, or i& = {w\, . . . ,wi}, I < k, k is a 
number of parameters representing word, and \tk\ = k is size of tk- 

3. Let a web page denoted by lj and a set of web pages indexed by search engine 
be fi containing pairs of term and web page. Let t x is search term and a web 
page contains t x is u>k, we obtain fl x — {(t x ,u} x )}, fl x C J?, or t x 6 uj x e J?^. 

is a cardinality of ft x . 

4. Let is a search term. S* = {wj, . . . , w ma x} is a web snippet (briefly snippet) 
about t x that returned by search engine, where max = ±50 words. L = 
{Si\i = 1, . . . , m} is a list of snippet. 

We develop an approach for extracting the keyword from web snippets based 
on a concept of overlap. A concept that can interpret our world as a composition 
of the parts and the parts be from other parts, whereby implementation of the 
overlap principle as intersection operator or logically t a AND tj is TRUE in 
space of event fl with two conditions: 

(a) Let t a and tb are search terms, t a Pi % and web pages contain t a AND % 
is uj a and uib, u> a ,Ub £ H J?b , \f2 a D f2b\ is a cardinality of l? a n fib , and 

n Q b c 

(b) Let t a ,tx,ty e S with |J? X | < then |J? n Q x \ > \Q a n J? v |. 

The implementation of overlap principle also can use the query of two objects: 
For each query query(t a ,tb) submit to search engine, the search engine return a 
collection of web snippets as a representation of objects in a group of web pages. 

Let us model the bag of words (BoW) of snippets to utilize weights of TF.IDF 
(term frequency-inverse document frequency) |14) : TF.IDF = tf(t x ,S).idf(t x ), 

tf(t x ,S) = 'E l j=i,...,N^2i=i,...,m 1 / max -> idfitx) = log N/df(t x ) where max is 
the number of terms/words in a snippet, m is the number of same terms/words 
in a snippet, N is the number of snippets containing name of actor, and df(t x ) 
is the number of snippets for term t x appears. Then, we product a vector space 
by v = TF.IDF /hs e [0, 1], hs is highest score of TF.IDF, such that v % £ [0, 1], 
i = 1, . . . , n, and v\ > vi > . . . v n . Thus, there is an one-to-one function ?y = 
w — > {vi\i = 1, . . . , n}, for w as a set of vocabularies. A micro cluster of words 
w pQ is words network G w from Web by using Jaccard coefficient to hit counts 
such that 

/. s _ \f2 x nf2 y \ 

OTm *"***' ty) ~ \n x \ + \n v \-\n x nn y \' 

and the optimal micro cluster T w is words tree, it is used for eliminating the 
irrelevant word from list of candidates or it can represent the same actor. Thus 
there is a subset of words w D = {wi\i = 1,2, ... , m}, m < n, and wo is subset 
of w. However, any words in tree still wo be the characteristic of two and more 
actors. We model a vector space of words tree Ui by dividing all their hit counts 
by highest hit count relatively: there is p : w D — > {v,i\i = 1,2,..., m). Based on 
the conditions (a) and (b) of overlap principle we define Si = u, — u, > for 
selecting a keyword t x from the optimal micro cluster, i.e. for S x > . . . > 5 y , t x 
is keyword. The steps for extracting keywords are as follows. 
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Table 1. An optimal micro cluster of an actor: "Abdullah Mohd Zin" 



Words v u 8 

network 1.00000 0.32680 0.67320 
international 0.57339 0.49485 0.07855 

computer 0.56474 0.48814 0.07660 

system 0.52506 0.52577 -0.00072 

software 0.50420 0.68041 -0.17621 

use 0.40142 1.00000 -0.59858 



Generate(keyword) 
INPUT : A set of actor 
OUTPUT : kcyword(s) of each actor 
STEPS : 



1. w = \{wi\i = 1, 2, . . . , n} <— Collect terms/words per actor from snippets. 

2. {vi\i = 1,2, ... ,n} 4— Generate vector for all w £ w by using TF.IDF. 

3. {iti\i = 1, 2, . . . , n} Generate vector for each hit count of w £ w. 

4. G w <!— Build the micro cluster using hit counts. 

5. T w i- Make the optimal micro cluster based on G w . 

6. If T w do not consist of trees, then collect and cut node with degree deg > 1 
for separating T w be trees. 

7. Select a cluster from trees of T w by using a predefined stable attribute. 

8. Find maximum 5 from candidate keywords in a cluster for generating the 
keyword. 



4 Experiment 

Let us consider information context of actors that includes all relevant relation- 
ships with their interaction history, where Yahoo! search engines fall short of 
utilizing any specific information, especially micro cluster information, and just 
therefore we use full text index search in web snippets. In experiment, we use 
maximum of 500 web snippets for search term t a representing an actor, and 
we consider words where the TF.IDF value > 0.3 x highest value of TF.IDF, 
or maximum number is 30 words, w = {network, minister, Malaysia, journal, 
datuk, department, Allah, international, Ismail, Nazri, computer, prime, ictac, 
learning, system, software, foxley, said, kebangsaan, performance, dr, university, 
Eric, use, accuracy, dblp, based, communications, utilization, author} for ex- 
ample is a set of 30 words from web snippets for actor " Abdullah Mohd Zin" . 
We test for 143 names, and we obtain 8 (5.59%) actors without a cluster of 
candidate words, 13 (9.09%) actors with only one cluster, and 122 (85.32%) 
persons have two or more keywords. In a case of "Abdullah Mohd Zin" we have 
{network, international, computer, system, software, use}, {Malaysia, accuracy}, 
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{datuk, Nazri, kebangsaan}, {minister, journal, ictac, dblp, communications, 
utilization}, {department, learning, said, performance}, {dr, university, based}, 
{prime, foxley, Eric, author}, {Allah, Ismail} as micro clusters of words. We 
can arrange the individual keywords according to their proximity to the stable 
attribute "academic", i.e. a set of words in SK = {sciences, faculty, associate, 
economic, prof, environment, career, journal, network, university, report, rela- 
tionship, context, . . .}. SK and 6 maximum exactly determine that "network" 
be a keyword for actor "Abdullah Mohd Zin" as an academic (not a politician), 
in Table[TJ Therefore, we can redefine <f> or <p as query(t a ,t x ). 

Under the average of recall, precision and F-measure, this method shows 
something to consider, i.e. the number of words in the cluster should be limited 
so that so average value of measurement is not affected by the lower, see Table 
2. 

Table 2. Average of optimal micro cluster results 

Method Recall Precision F-measure 
'Delta (S) 45.8% 29.5% 35.9% 



Some keywords product high recall and low precision, and also the pairs 
of actor-double-keyword products low recall and precision, because web pages 
about an actor have greater variety. 

5 Conclusion and Future Work 

This article presents a practical methodology for extracting social network based 
on superficial methods (in unsupervised research strem), quality measures for 
clustering the actors and relations between them with keywords, singleton and 
doubleton, and URLs address, we obtain the scale < n(n — l)/2. The social 
network extraction process relies on a dynamic knowledge in Web, thus this is 
still under development, and research work must be completed and enriched. 
In particular, a careful next study of measures for their own and compared to 
each other of superficial methods, mainly to consider complexity and number of 
submitted queries. 
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